The other day, I came a small problem: I was investigating a dataset, and the different variables clearly showed a non-linear behaviour.
Consequently, I could not apply the classical linear regression.
An approach to solve this kind of problem is LOESS regression, which stands for
locally weighted scatterplot smoothing
.
In the following, we will work through an artificial example to make the concept clear and show how to apply in R.
We first create a dataframe with two variables, a and y, which are non-linearily related. The plot shows this relationship.
We perform two regressions, one linear and one loess.
## ## Call: ## lm(formula = y ~ a) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.303 -0.635 0.294 0.792 1.635 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.100335 0.136374 22.7 <2e-16 *** ## a 0.007645 0.000305 25.1 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.973 on 98 degrees of freedom ## Multiple R-squared: 0.865, Adjusted R-squared: 0.864 ## F-statistic: 630 on 1 and 98 DF, p-value: <2e-16
## Call: ## loess(formula = y ~ a) ## ## Number of Observations: 100 ## Equivalent Number of Parameters: 4.56 ## Residual Standard Error: 0.587 ## Trace of smoother matrix: 4.98 ## ## Control settings: ## normalize: TRUE ## span : 0.75 ## degree : 2 ## family : gaussian ## surface : interpolate cell = 0.2
The scatterplot clearly shows the better fit from the loess regression.
The summary of the regression shows one difference between linear and loess: The loess regression does not give the parameters, since it is a local regression. The classical measure of goodness of fit is r2.
The linear model has a r2 of
0.865
. For the LOESS we have to calculate the r2 and follow this proposal.
## [1] 0.9528
The r
2
from the loess is
0.953
and thus very good and better than the r
2
from the linear regression.
Also the investigation of the plot of residuals vs fitted/predicted values indicates a much better fit of the LOSS regression compared to the linear regression (the residuals plot of the linear regression shows the structure – which we clearly do not want); the QQ-plot shows some issues on the tails.
But what to do now with it? Finally, we probably want to
predict
some values out of this exercise.
The
predict
-function helps us with this exercise.
## a linreg loess
## 1 10 3.177 2.265
## 2 400 6.158 7.133
## 3 900 9.981 9.365
Alright, this was a quick introduction on how to apply the LOESS regression.
And here is the full code:
## @knitr create_data y <- seq(from=1, to=10, length.out=100) a <- y^3 +y^2 + rnorm(100,mean=0, sd=30) data <- data.frame(a=a, y=y) plot(y=y, x=a) ## @knitr linreg linreg <- lm(y~a) summary(linreg) loess <- loess(y~a) summary(loess) scatter.smooth(data) abline(linreg, col="blue") ## @knitr loess_fit hat <- predict(loess) plot(y~a) lines(a[order(a)], hat[order(hat)], col="red") (r_sq_loess <- cor(y, hat)^2) ## @knitr fit_loess par(mfrow=c(2,2)) plot(linreg$fitted, linreg$residuals, main="classical linear regression") plot(loess$fitted, loess$residuals, main="LOESS") # normal probablility plot qqnorm(linreg$residuals, ylim=c(-2,2)) qqline(linreg$residuals) qqnorm(loess$residuals, ylim=c(-2,2)) qqline(loess$residuals) ## @knitr predict predict <- data.frame(a=c(10,400,900)) scatter.smooth(data) abline(linreg, col="blue") predict$linreg <- predict(linreg, predict) predict$loess <- predict(loess, predict) predictpoints(x=predict$a, y=predict(linreg, predict), col="blue", pch=18, cex=2) points(x=predict$a, y=predict(loess, predict), col="red", pch=18, cex=2)